A German financial company wants to create a model that predicts the
defaults on consumer loans in the German market. When you are called in,
the company has already built a model and asks you to evaluate it
because there is a concern that this model unfairly evaluates young
customers. Your task is to figure out if this is true and to devise a
way to correct this problem. The data used to make predictions as well
as the predictions can be found in germancredit data.
The data contains the outcome of interest BAD indicating
whether a customer has defaulted on a loan. A model to predict default
has already been fit and predicted probabilities of default
(probability) and predicted status coded as yes/no for
default (predicted) have been concatenated to the original
data.
Let’s look at the prediction made by some model that was fit.
Here is the confusion matrix:
##
## model_pred BAD GOOD
## PredYesDefault 121 57
## PredNoDefault 179 643
| Actual 1. | Actual 0 | |
|---|---|---|
| Pred 1 | TP=121 | FP =57 |
| Pred 0 | FN=179 | TN =643 |
It looks OK.
Here is the ROC curve:
library(pROC)
roc_score=roc(germancredit$BAD, germancredit$probability) #AUC score
plot(roc_score ,main ="ROC curve -- Logistic Regression ")
Again, pretty good.
The decile plot looks decent as well.
lift <- function(depvar, predcol, groups=10) {
if(is.factor(depvar)) depvar <- as.integer(as.character(depvar))
if(is.factor(predcol)) predcol <- as.integer(as.character(predcol))
helper = data.frame(cbind(depvar, predcol))
helper[,"bucket"] = ntile(-helper[,"predcol"], groups)
gaintable = helper %>% group_by(bucket) %>%
summarise_at(vars(depvar), list(total = ~n(),
totalresp=~sum(., na.rm = TRUE))) %>%
mutate(Cumresp = cumsum(totalresp),
Gain=Cumresp/sum(totalresp)*100,
Cumlift=Gain/(bucket*(100/groups)))
return(gaintable)
}
library(dplyr)
default = 1*(germancredit$BAD=="BAD")
revP =germancredit$probability
dt = lift(default, revP, groups = 10)
barplot(dt$totalresp/dt$total, ylab="Decile", xlab="Bucket")
abline(h=mean(default ),lty=2,col="red")
The residual does show a little concerning point on the right but
it’s not obvious.
So overall, if you just used the traditional evaluation method, you would conclude that there is no problem.
Now, let’s look at this by age and gender.
germancredit$Age_cat= cut(germancredit$Age,c(0,25,35,45,75))
germancredit$FemaleAge_cat= germancredit$Female: germancredit$Age_cat
ggplot(germancredit)+geom_bar()+aes(x=Age_cat,fill=BAD)+facet_grid(~Female)
You see more females in 25-50 range is represented in the
dataset.
The proportion of females that actually defaulted is lower than that of males.
The mosaic plot shows some discrepancy in the default probability by age group.
## Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2
## 3.5.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
## Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The confusion matrix conditioned on age:
## , , = (0,25]
##
##
## model_pred BAD GOOD
## PredYesDefault 36 15
## PredNoDefault 44 95
##
## , , = (25,35]
##
##
## model_pred BAD GOOD
## PredYesDefault 47 25
## PredNoDefault 71 255
##
## , , = (35,45]
##
##
## model_pred BAD GOOD
## PredYesDefault 15 7
## PredNoDefault 40 164
##
## , , = (45,75]
##
##
## model_pred BAD GOOD
## PredYesDefault 23 10
## PredNoDefault 24 129
In the US, judges, probation officers, and parole officers use algorithms to evaluate the likelihood of a criminal defendant re-offending, a concept commonly referred to as recidivism. Numerous risk assessment algorithms are circulating with two prominent nationwide tools provided by commercial vendors.
One of these tools, Northpointe’s COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), has made national headlines about how it seems to have a bias towards certain protected groups. Your job is to figure out if this is the case.
https://github.com/propublica/compas-analysis/
## Help on topic 'compas' was found in the following packages:
##
## Package Library
## fairmodels /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
## fairness /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
## mlr3fairness /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
##
##
## Using the first match ...
The variable of interest is the two_year_recid,
indicating if the individual committed a crime within two years.
##
## 0 1
## 3363 2809
There are two score for recidivism risk
##
## 0 1
## Low 2345 1076
## Medium 721 886
## High 297 847
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
Another is the decile score
##
## 0 1
## 1 1009 277
## 2 558 264
## 3 403 244
## 4 375 291
## 5 302 280
## 6 221 308
## 7 198 298
## 8 118 302
## 9 120 300
## 10 59 245
If you look at the risk and the outcome by race you can see
discrepancies.
How would you evaluate the COMPAS result?
The dataset used to predict whether income exceeds $50K/yr based on census data. Also known as the “Census Income” dataset Train dataset contains 13 features and 30178 observations. Test dataset contains 13 features and 15315 observations. Target column is “target”: A binary factor where 1: <=50K and 2: >50K for annual income. The column “sex” is set as a protected attribute.
Here are the EDA result.
## Warning: attributes are not identical across measure variables; they will be
## dropped
Researchers wants to know who makes more money. So they fit a logistic regression.
How does the residuals look like?
How about a decile plot?
What is the conclusion? Is there a problem?
The diabetes dataset describes the clinical care at 130 US hospitals and integrated delivery networks from 1999 to 2008. The classification task is to predict whether a patient will readmit within 30 days.
https://fairlearn.org/main/user_guide/datasets/diabetes_hospital_data.html https://www.hindawi.com/journals/bmri/2014/781670/
We grabbed the preprocessed data so you don’t need to clean it.
The target is readmit_30_days, which is a binary
attribute that indicates whether the patient was readmitted within 30
days.
##
## 0 1
## 90409 11357
The researchers fit a glm model.
ROC
library(pROC)
roc_score=roc(diabetic$readmit_30_days, diabetes_glm_model$fitted) #AUC score
plot(roc_score ,main ="ROC curve -- Logistic Regression ")
How does the residuals look like?
The confusion matrix is not useful since by default cutoff of 0.5 everyone is predicted as 0.
##
## 0 1
## 0 90409 11357
How about a decile plot?
You see that the model is capturing something. Do you see any problem with this model with protected attributes such as race and gender?
##
## Call:
## glm(formula = readmit_30_days ~ race + gender + age + discharge_disposition_id +
## admission_source_id + time_in_hospital + medical_specialty +
## num_lab_procedures + num_procedures + num_medications + primary_diagnosis +
## number_diagnoses + max_glu_serum + A1Cresult + insulin +
## change + diabetesMed + medicare + medicaid + had_emergency +
## had_inpatient_days + had_outpatient_days, family = binomial(link = "logit"),
## data = diabetic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8673783 0.1459807 -19.642 < 2e-16
## raceAsian 0.0274957 0.1346293 0.204 0.838172
## raceCaucasian 0.0106417 0.0268911 0.396 0.692303
## raceHispanic -0.0175268 0.0773179 -0.227 0.820669
## raceOther -0.0879814 0.0916783 -0.960 0.337219
## raceUnknown -0.1498314 0.0809323 -1.851 0.064124
## genderMale 0.0265442 0.0204684 1.297 0.194686
## genderUnknown/Invalid -7.9067147 68.2352948 -0.116 0.907752
## age30-60 years -0.1236817 0.0700715 -1.765 0.077551
## ageOver 60 years -0.0343349 0.0708234 -0.485 0.627821
## discharge_disposition_idOther 0.3285373 0.0221572 14.828 < 2e-16
## admission_source_idOther -0.1685689 0.0369529 -4.562 5.07e-06
## admission_source_idReferral -0.0090589 0.0259570 -0.349 0.727091
## time_in_hospital 0.0092932 0.0039415 2.358 0.018386
## medical_specialtyEmergency/Trauma 0.1283383 0.0660545 1.943 0.052027
## medical_specialtyFamily/GeneralPractice 0.2716233 0.0641085 4.237 2.27e-05
## medical_specialtyInternalMedicine 0.2318526 0.0592251 3.915 9.05e-05
## medical_specialtyMissing 0.2073489 0.0546659 3.793 0.000149
## medical_specialtyOther 0.1949292 0.0583000 3.344 0.000827
## num_lab_procedures 0.0006204 0.0006072 1.022 0.306885
## num_procedures -0.0212016 0.0071067 -2.983 0.002851
## num_medications 0.0056286 0.0016258 3.462 0.000536
## primary_diagnosisGenitourinary Issues -0.2115712 0.0571419 -3.703 0.000213
## primary_diagnosisMusculoskeletal Issues -0.2386157 0.0623376 -3.828 0.000129
## primary_diagnosisOther -0.1029096 0.0365812 -2.813 0.004905
## primary_diagnosisRespiratory Issues -0.3141149 0.0448921 -6.997 2.61e-12
## number_diagnoses 0.0319274 0.0061935 5.155 2.54e-07
## max_glu_serum>300 0.0743506 0.1143393 0.650 0.515523
## max_glu_serumNone -0.0656577 0.0850484 -0.772 0.440112
## max_glu_serumNorm -0.0210623 0.1014167 -0.208 0.835478
## A1Cresult>8 -0.0410960 0.0666580 -0.617 0.537551
## A1CresultNone 0.0922931 0.0561026 1.645 0.099954
## A1CresultNorm -0.0541563 0.0729656 -0.742 0.457956
## insulinNo -0.2503349 0.0399936 -6.259 3.87e-10
## insulinSteady -0.2004377 0.0364121 -5.505 3.70e-08
## insulinUp -0.0938975 0.0388584 -2.416 0.015675
## changeNo 0.0990523 0.0287149 3.450 0.000562
## diabetesMedYes 0.1496604 0.0327180 4.574 4.78e-06
## medicareTRUE -0.0850352 0.0234337 -3.629 0.000285
## medicaidTRUE -0.0181985 0.0554751 -0.328 0.742875
## had_emergencyTRUE 0.2931558 0.0293433 9.991 < 2e-16
## had_inpatient_daysTRUE 0.6270787 0.0211942 29.587 < 2e-16
## had_outpatient_daysTRUE 0.0635901 0.0266908 2.382 0.017197
##
## (Intercept) ***
## raceAsian
## raceCaucasian
## raceHispanic
## raceOther
## raceUnknown .
## genderMale
## genderUnknown/Invalid
## age30-60 years .
## ageOver 60 years
## discharge_disposition_idOther ***
## admission_source_idOther ***
## admission_source_idReferral
## time_in_hospital *
## medical_specialtyEmergency/Trauma .
## medical_specialtyFamily/GeneralPractice ***
## medical_specialtyInternalMedicine ***
## medical_specialtyMissing ***
## medical_specialtyOther ***
## num_lab_procedures
## num_procedures **
## num_medications ***
## primary_diagnosisGenitourinary Issues ***
## primary_diagnosisMusculoskeletal Issues ***
## primary_diagnosisOther **
## primary_diagnosisRespiratory Issues ***
## number_diagnoses ***
## max_glu_serum>300
## max_glu_serumNone
## max_glu_serumNorm
## A1Cresult>8
## A1CresultNone .
## A1CresultNorm
## insulinNo ***
## insulinSteady ***
## insulinUp *
## changeNo ***
## diabetesMedYes ***
## medicareTRUE ***
## medicaidTRUE
## had_emergencyTRUE ***
## had_inpatient_daysTRUE ***
## had_outpatient_daysTRUE *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 71205 on 101765 degrees of freedom
## Residual deviance: 68916 on 101723 degrees of freedom
## AIC: 69002
##
## Number of Fisher Scoring iterations: 9
Choose one of the data described above as your target problem.
Please write your answer in full sentences.
Diabetes
Discuss what a favorable label in this problem is and what does a favorable label grant the affected user? Is it assertive or non-punitive?
Please write your answer in full sentences.
The readmission label is the favorable label, it is assistive if the patient can know that which hospital has a lower readmission rate to help them decide which hospital they are going to.
What type of justice is this issue about?
Please write your answer in full sentences.
Procedure Justice. If we discuss from a hospital perspective, the readmission label of the patient might lead to an inappropriate decision strategy whether the hospital gonna accept the patient or not due to not exceed the threshold of readmission rate within 30days to prevent from getting punishment from the government.
Discuss the potential concerns about the data being used.
Please write your answer in full sentences.
The patient can choose their hospital based on the readmission rate of the hospital.
The hospital can use the label of the patient to control their overall readmission rate.
Discuss what type of group fairness metrics is appropriate for this problem.
Please write your answer in full sentences.
Separation. Since the label might lead to an external human intervene of the result. The prediction $\hat{Y}$ will no longer be independent to the $Z|Y$.
Using the appropriate fairness metrics, show if there are concerns in the prediction algorithm.
Please write your answer in full sentences.
Given that you have access to the original data, but not to the model used to make the prediction, discuss which mitigation strategy might be more appropriate to deal with the problem, if any.
Please write your answer in full sentences.
Fairness metrics have several ways to classify them. Many fairness metrics for discrete outcomes are derived using the conditional confusion matrix. For each of the protected groups of interest, we can define a conditional confusion matrix as:
| Actual 1 | Actual 0 | \(\dots\) | Actual 1 | Actual 0 | |
|---|---|---|---|---|---|
| Pred 1 | \(TP_{g1}\) | \(FP_{g1}\) | \(\dots\) | \(TP_{g2}\) | \(FP_{g2}\) |
| Pred 0 | \(FN_{g1}\) | \(TN_{g1}\) | \(\dots\) | \(FN_{g2}\) | \(TN_{g2}\) |
Depending on the context different metrics are appropriate.
Demographic parity is one of the most popular fairness indicators in the literature.
Demographic parity is achieved if the absolute number of positive predictions in the subgroups are close to each other. \[(TP_g + FP_g)\] This measure does not take true class into consideration and only depends on the model predictions. In some literature, demographic parity is also referred to as statistical parity or independence.
## (0,25] (25,35] (35,45] (45,75]
## Positively classified 51 72.000000 22.0000000 33.0000000
## Demographic Parity 1 1.411765 0.4313725 0.6470588
## Group size 190 398.000000 226.0000000 186.0000000
Of course, comparing the absolute number of positive predictions will show a high disparity when the number of cases within each group is different, which artificially boosts the disparity. This is true in our case:
##
## Female Male
## 690 310
Proportional parity is calculated based on the comparison of the proportion of all positively classified individuals in all subgroups of the data. \[(TP_g + FP_g) / (TP_g + FP_g + TN_g + FN_g)\] Proportional parity is very similar to demographic parity but modifies it to address the issue that when the number of cases within each group is different, which artificially boosts the disparity. In some literature, proportional parity and demographic parity are considered equivalent, which is true when the protected group sizes are equivalent. Proportional parity is achieved if the proportion of positive predictions in the subgroups are close to each other. Similar to the demographic parity, this measure also does not depend on the true labels.
In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their proportion of positively predicted observations are lower or higher compared to the reference group. Lower proportions will be reflected in numbers lower than 1 in the returned named vector.
## (0,25] (25,35] (35,45] (45,75]
## Proportion 0.2684211 0.1809045 0.09734513 0.1774194
## Proportional Parity 1.0000000 0.6739580 0.36265834 0.6609741
## Group size 190.0000000 398.0000000 226.00000000 186.0000000
Predictive rate parity is achieved if the precisions (or positive predictive values) in the subgroups are close to each other. The precision stands for the number of the true positives divided by the total number of examples predicted positive within a group. \[TP_g / (TP_g + FP_g)\]
## (0,25] (25,35] (35,45] (45,75]
## Precision 0.2941176 0.3472222 0.3181818 0.3030303
## Predictive Rate Parity 1.0000000 1.1805556 1.0818182 1.0303030
## Group size 190.0000000 398.0000000 226.0000000 186.0000000
The first row shows the raw precision values for the age groups. The second row displays the relative previsions compared to a 0-25 age group.
In a perfect world, all predictive rate parities should be equal to one, which would mean that precision in every group is the same as in the base group. In practice, values are going to be different. The parity above one indicates that precision in this group is relatively higher, whereas a lower parity implies a lower precision. Observing a large variance in parities should hint that the model is not performing equally well for different age groups.
The result suggests that the model is worse for younger people. This implies that there are more cases where the model mistakingly predicts that a person will default if they are young.
If the middle aged group is set as a base group, the raw precision values do not change, only the relative metrics will change.
## (25,35] (0,25] (35,45] (45,75]
## Precision 0.3472222 0.2941176 0.3181818 0.3030303
## Predictive Rate Parity 1.0000000 0.8470588 0.9163636 0.8727273
## Group size 398.0000000 190.0000000 226.0000000 186.0000000
False negative rates are calculated by the division of false negatives with all positives (irrespective of predicted values). \[FN_g / (TP_g + FN_g)\] False negative rate parity is achieved if the false negative rates (the ratio between the number of false negatives and the total number of positives) in the subgroups are close to each other.
In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their false negative rates are lower or higher compared to the reference group. Lower false negative error rates will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean BETTER prediction for the subgroup.
## (0,25] (25,35] (35,45] (45,75]
## FNR 0.8636364 0.9107143 0.9590643 0.9280576
## FNR Parity 1.0000000 1.0545113 1.1104955 1.0745930
## Group size 190.0000000 398.0000000 226.0000000 186.0000000
#### False positive rate parity [Chouldechova 2017]
False positive rates are calculated by the division of false positives with all negatives (irrespective of predicted values). \[FP_g / (TN_g + FP_g)\] False positive rate parity is achieved if the false positive rates (the ratio between the number of false positives and the total number of negatives) in the subgroups are close to each other.
In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their false positive rates are lower or higher compared to the reference group. Lower false positives error rates will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean BETTER prediction for the subgroup.
## (0,25] (25,35] (35,45] (45,75]
## FPR 0.45 0.3983051 0.2727273 0.4893617
## FPR Parity 1.00 0.8851224 0.6060606 1.0874704
## Group size 190.00 398.0000000 226.0000000 186.0000000
Equalized Odds are calculated by the division of true positives with all positives (irrespective of predicted values). \[TP_g / (TP_g + FN_g)\] This metrics equals to what is traditionally known as sensitivity.
In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their sensitivities are lower or higher compared to the reference group. Lower sensitivities will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup. Equalized odds are achieved if the sensitivities in the subgroups are close to each other.
## (0,25] (25,35] (35,45] (45,75]
## Sensitivity 0.1363636 0.08928571 0.04093567 0.07194245
## Equalized odds 1.0000000 0.65476190 0.30019493 0.52757794
## Group size 190.0000000 398.00000000 226.00000000 186.00000000
Accuracy metrics are calculated by the division of correctly predicted observations (the sum of all true positives and true negatives) with the number of all predictions. \[(TP_g + TN_g) / (TP_g + FP_g + TN_g + FN_g)\] Accuracy parity is achieved if the accuracies (all accurately classified examples divided by the total number of examples) in the subgroups are close to each other.
In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their accuracies are lower or higher compared to the reference group. Lower accuracies will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.
## (0,25] (25,35] (35,45] (45,75]
## Accuracy 0.3105263 0.2412060 0.2079646 0.1827957
## Accuracy Parity 1.0000000 0.7767652 0.6697165 0.5886641
## Group size 190.0000000 398.0000000 226.0000000 186.0000000
Negative predictive value parity can be considered the ‘inverse’ of the predictive rate parity. Negative Predictive Values are calculated by the division of true negatives with all predicted negatives. \[TN / (TN + FN)\] Negative predictive value parity is achieved if the negative predictive values in the subgroups are close to each other.
In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their negative predictive values are lower or higher compared to the reference group. Lower negative predictive values will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.
## (0,25] (25,35] (35,45] (45,75]
## NPV 0.3165468 0.2177914 0.1960784 0.1568627
## NPV Parity 1.0000000 0.6880229 0.6194296 0.4955437
## Group size 190.0000000 398.0000000 226.0000000 186.0000000
In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their Matthews Correlation Coefficients are lower or higher compared to the reference group. Lower Matthews Correlation Coefficients rates will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.
The Matthews correlation coefficient (MCC) considers all four classes of the confusion matrix. MCC is sometimes referred to as the single most powerful metric in binary classification problems, especially for data with class imbalances.
\[(TP_g×TN_g-FP_g×FN_g)/\sqrt{((TP_g+FP_g)×(TP_g+FN_g)×(TN_g+FP_g)×(TN_g+FN_g))}\]
## (0,25] (25,35] (35,45] (45,75]
## MCC -0.3494421 -0.3666323 -0.3355449 -0.4748169
## MCC Parity 1.0000000 1.0491931 0.9602303 1.3587854
## Group size 190.0000000 398.0000000 226.0000000 186.0000000
Specificity parity can be considered the ‘inverse’ of the equalized
odds. Specificity is calculated by the division of true negatives with
all negatives (irrespective of predicted values). \[TN_g / (TN_g + FP_g)\]
Specificity parity is achieved if the specificity (the ratio of the
number of the true negatives and the total number of negatives) in the
subgroups are close to each other.
In the returned named vector, the reference group will be assigned 1, while all other groups will be assigned values according to whether their specificity is lower or higher compared to the reference group. Lower specificity will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.
## (0,25] (25,35] (35,45] (45,75]
## Specificity 0.55 0.6016949 0.7272727 0.5106383
## Specificity Parity 1.00 1.0939908 1.3223140 0.9284333
## Group size 190.00 398.0000000 226.0000000 186.0000000
The equality of the area under the ROC for different groups identified by protected attributes can be seen as analogous to the equality of accuracy.
This function computes the ROC AUC values for each subgroup. In the returned table, the reference group will be assigned 1, while all other groups will be assigned values according to whether their ROC AUC values are lower or higher compared to the reference group. Lower ROC AUC will be reflected in numbers lower than 1 in the returned named vector, thus numbers lower than 1 mean WORSE prediction for the subgroup.
This function calculates ROC AUC and visualizes ROC curves for all subgroups. Note that probabilities must be defined for this function. Also, as ROC evaluates all possible cutoffs, the cutoff argument is excluded from this function.
## (0,25] (25,35] (35,45] (45,75]
## ROC AUC 0.7389773 0.7820218 0.8137161 0.8152457
## ROC AUC Parity 1.0000000 1.0582488 1.1011382 1.1032080
## Group size 190.0000000 398.0000000 226.0000000 186.0000000
Apart from the standard outputs, the function also returns ROC curves
for each of the subgroups.
A handful of software has been made available over the last few years. These are usually a combination of fairness metrics calculation, followed by visualizations.
Because they automate the process, they are useful if you can get
them to work. Here is an example of using fairmodels.
We will look at the germancredit data again. But here we
will create our model. As a comparison, let’s fit logistic regression
and the random forest model.
You need to create an explainer object.
## Preparation of a new explainer is initiated
## -> model label : lm ( default )
## -> data : 1000 rows 21 cols
## -> target variable : 1000 values
## -> predict function : yhat.glm will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package stats , ver. 4.3.1 , task classification ( default )
## -> predicted values : numerical, min = 0.05264789 , mean = 0.7 , max = 0.9983644
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.9774084 , mean = 6.610203e-13 , max = 0.9192795
## A new explainer has been created!
## Preparation of a new explainer is initiated
## -> model label : ranger ( default )
## -> data : 1000 rows 21 cols
## -> target variable : 1000 values
## -> predict function : yhat.ranger will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package ranger , ver. 0.16.0 , task classification ( default )
## -> predicted values : numerical, min = 0.1094149 , mean = 0.6964699 , max = 0.9949683
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.6977659 , mean = 0.003530108 , max = 0.5061689
## A new explainer has been created!
You can run fairness check on one model. Which shows
## Creating fairness classification object
## -> Privileged subgroup : character ([32m Ok [39m )
## -> Protected variable : factor ([32m Ok [39m )
## -> Cutoff values for explainers : 0.5 ( for all subgroups )
## -> Fairness objects : 0 objects
## -> Checking explainers : 1 in total ( [32m compatible [39m )
## -> Metric calculation : 13/13 metrics calculated for all models
## [32m Fairness object created succesfully [39m
##
## Fairness check for models: lm
##
## [31mlm passes 3/5 metrics
## [39mTotal loss : 1.557322
Or you can compare the metrics for different models.
## Creating fairness classification object
## -> Privileged subgroup : character ([32m Ok [39m )
## -> Protected variable : factor ([32m Ok [39m )
## -> Cutoff values for explainers : 0.5 ( for all subgroups )
## -> Fairness objects : 0 objects
## -> Checking explainers : 2 in total ( [32m compatible [39m )
## -> Metric calculation : 10/13 metrics calculated for all models ( [33m3 NA created[39m )
## [32m Fairness object created succesfully [39m
##
## Fairness check for models: lm, ranger
##
## [31mlm passes 3/5 metrics
## [39mTotal loss : 1.557322
##
## [33mranger passes 4/5 metrics
## [39mTotal loss : 1.624424
You can check this value for other variables as well.
## Creating fairness classification object
## -> Privileged subgroup : character ([32m Ok [39m )
## -> Protected variable : factor ([32m Ok [39m )
## -> Cutoff values for explainers : 0.5 ( for all subgroups )
## -> Fairness objects : 0 objects
## -> Checking explainers : 2 in total ( [32m compatible [39m )
## -> Metric calculation : 10/13 metrics calculated for all models ( [33m3 NA created[39m )
## [32m Fairness object created succesfully [39m
https://cran.r-project.org/web/packages/fairness/vignettes/fairness.html https://ashryaagr.github.io/Fairness.jl/dev/datasets/
Calders, T., Verwer, S. Three naive Bayes approaches for discrimination-free classification. Data Min Knowl Disc 21, 277–292 (2010). https://doi.org/10.1007/s10618-010-0190-x